This is Jacob Cram’s analysis of the Tidy Tuesday from 31 March 2020 that he did at R club on 03 April 2019
Load in the data the fast way. One could also bring in all of the tidy Tuesday data
brewing_materials <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewing_materials.csv')
Parsed with column specification:
cols(
data_type = [31mcol_character()[39m,
material_type = [31mcol_character()[39m,
year = [32mcol_double()[39m,
month = [32mcol_double()[39m,
type = [31mcol_character()[39m,
month_current = [32mcol_double()[39m,
month_prior_year = [32mcol_double()[39m,
ytd_current = [32mcol_double()[39m,
ytd_prior_year = [32mcol_double()[39m
)
beer_taxed <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_taxed.csv')
Parsed with column specification:
cols(
data_type = [31mcol_character()[39m,
tax_status = [31mcol_character()[39m,
year = [32mcol_double()[39m,
month = [32mcol_double()[39m,
type = [31mcol_character()[39m,
month_current = [32mcol_double()[39m,
month_prior_year = [32mcol_double()[39m,
ytd_current = [32mcol_double()[39m,
ytd_prior_year = [32mcol_double()[39m,
tax_rate = [31mcol_character()[39m
)
brewer_size <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/brewer_size.csv')
Parsed with column specification:
cols(
year = [32mcol_double()[39m,
brewer_size = [31mcol_character()[39m,
n_of_brewers = [32mcol_double()[39m,
total_barrels = [32mcol_double()[39m,
taxable_removals = [32mcol_double()[39m,
total_shipped = [32mcol_double()[39m
)
beer_states <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-31/beer_states.csv')
Parsed with column specification:
cols(
state = [31mcol_character()[39m,
year = [32mcol_double()[39m,
barrels = [32mcol_double()[39m,
type = [31mcol_character()[39m
)
Here I look at the abundnace of malt and malt products over the couse of the timeseries.
jacob_brewing_materials <- brewing_materials %>% filter(type == "Malt and malt products") %>% mutate(sortaDate = year + (month-.5)/12)
jacob_brewing_materials %>% ggplot(aes(x = sortaDate, y = month_current)) + geom_point()
Look at all of the grain products from the whole time series.
jacob_brewing_materials <- brewing_materials %>% filter(material_type == "Grain Products") %>% mutate(sortaDate = year + (month-.5)/12)
jacob_brewing_materials %>% ggplot(aes(x = sortaDate, y = month_current)) + geom_point() + facet_wrap(~type, scales = "free")
Gran and non grain together.
jacob_brewing_materials <- brewing_materials %>% mutate(sortaDate = year + (month-.5)/12) %>% filter(material_type %in% c("Grain Products", "Non-Grain Products"))
plt <- jacob_brewing_materials %>% ggplot(aes(x = sortaDate, y = month_current, col = type)) + geom_point() + facet_wrap(~material_type)
plt
Plotly allows me to mouse over points and see what they represent. I can slso do things like zoom into this plot.
#library(plotly)
ggplotly(plt)
We wanted to look at seasonal patterns. Here I plot vs month and use color for year. Malt and malt products only.
jacob_brewing_materials <- brewing_materials %>% filter(type == "Malt and malt products", year < 2015) %>% mutate(sortaDate = year + (month-.5)/12)
plt <- jacob_brewing_materials %>% ggplot(aes(x = month, y = month_current, col = year)) + geom_point() + scale_color_viridis_c()
ggplotly(plt)
We decided to model malt and malt products as a seasonal sin function and then also look for long term effects.
jacob_brewing_materials <- jacob_brewing_materials %>% mutate(klausSin = sin(2 * pi * sortaDate), klausCos = cos(2 * pi * sortaDate))
jacob_malt <- jacob_brewing_materials %>% filter(type == "Malt and malt products", year < 2015)
mod <- lm(month_current ~ klausSin + klausCos + sortaDate,data = jacob_malt)
summary(mod)
Call:
lm(formula = month_current ~ klausSin + klausCos + sortaDate,
data = jacob_malt)
Residuals:
Min 1Q Median 3Q Max
-118371947 -9869121 265592 10809449 38606633
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.834e+10 2.266e+09 8.096 5.25e-12 ***
klausSin 1.308e+07 3.218e+06 4.063 0.000113 ***
klausCos -3.735e+07 3.198e+06 -11.679 < 2e-16 ***
sortaDate -8.948e+06 1.126e+06 -7.944 1.04e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 20730000 on 80 degrees of freedom
Multiple R-squared: 0.7372, Adjusted R-squared: 0.7273
F-statistic: 74.8 on 3 and 80 DF, p-value: < 2.2e-16
#plot(mod$fit, )
There are both seasonal and long term effects. If we include the 2015 data, where everthing is a lot lower (perhaps because the data weren’t all in from that year yet), we don’t have these nice effects.
Predict estemates the value from the model. We plot model predictions (lines) vs actual values (points)
jacob_malt_2 <- jacob_malt %>% mutate(pred = predict(mod, jacob_malt))
jacob_malt_2 %>% ggplot(aes(x = sortaDate, y = pred)) + geom_line() + geom_point(aes(y = month_current))
Look at the residuals of the model.
jacob_malt_2 %>% ggplot(aes(x = sortaDate, y = pred - month_current)) + geom_point()
And that all we had time for.